22 research outputs found
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
Robust Development of Dependable Software Systems
The indissoluble bonds of computers and failures have produced a plurality of fault tolerant techniques to satisfy, potentially, any dependability requirement. As a consequence, the development of dependable systems is not based on inventing the mechanism that provides the desired dependability guarantees. Rather, it is based on selecting from the existing techniques the one that best meets the system's dependability requirements. Then, some codiøcation of the selected technique can be used as the search-key for retrieving from a repository of fault tolerant mechanisms the one that implements the selected technique. Hence, the development of dependable systems becomes a process that transforms a set of dependability constraints into a fault tolerant mechanism that meets them. The focus of our work is to ensure the rigorous development of dependable systems by creating a formal basis for the aforementioned selection process. More precisely, this formal basis consists of: a system mode..
Fault Tolerant Software Architectures
Coping explicitly with failures during the conception and the design of software development complicates significantly the designer's job. The design complexity leads to software descriptions difficult to understand, which have to undergo many simplifications until their first functioning version. To support the systematic development of complex, fault tolerant software, this paper proposes a layered framework for the analysis of the fault tolerance software properties, where the top-most layer provides the means for specifying the abstract failure semantics expressed in the initial conception stage, and each successive layer is a refinement towards an elaborated description of a fault tolerant software architecture. We present the logical vehicle that permits reasoning on the equivalence or the compatibility of the various expressions of fault tolerance properties at various abstraction levels. In addition, we propose a mapping schema, which permits the correct transformation of abstract entities into concrete ones, during a refinement process